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Abstract. MCMC algorithms such as Metropolis-Hastings algorithms 
are slowed down by the computation of complex target distributions as 
exemplified by huge datasets. We offer in this paper a useful generali¬ 
sation of the Delayed Acceptance approach, devised to reduce the com¬ 
putational costs of such algorithms by a simple and universal divide- 
and-conquer strategy. The idea behind the generic acceleration is to 
divide the acceptance step into several parts, aiming at a major re¬ 
duction in computing time that out-ranks the corresponding reduction 
in acceptance probability. Each of the components can be sequentially 
compared with a uniform variate, the first rejection signalling that the 
proposed value is considered no further. We develop moreover theo¬ 
retical bounds for the variance of associated estimators with respect 
to the variance of the standard Metropolis-Hastings and detail some 
results on optimal scaling and general optimisation of the procedure. 
We illustrate those accelerating features on a series of examples. 


Keywords: Large Scale Learning and Big Data, MCMC, likelihood function, 
acceptance probability, mixtures of distributions, Jeffreys prior 

1. INTRODUCTION 

When running an MCMC sampler such as Metropolis-Hastings algorithms 
(Robert and Casella, 2004), the complexity of the target density required by the 
acceptance ratio may lead to severe slow-downs in the execution of the algorithm. 
A direct illustration of this difficulty is the simulation from a posterior distribu¬ 
tion involving a large dataset of n points for which the computing time is at least 
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of order O(n). Several solutions to this issue have been proposed in the recent 
literature (Korattikara et al., 2013, Neiswanger et ah, 2013, Scott et al., 2013, 
Wang and Dunson, 2013), taking advantage of the likelihood decomposition 

n 

(1) n« 

i =1 

to handle subsets of the data on different processors (CPU), graphical units 
(GPU), or even computers. However, there is no consensus on the method of 
choice, some leading to instabilities by removing most prior inputs and others to 
approximations delicate to evaluate or even to implement. 

Our approach here is to delay acceptance (rather than rejection as in Tierney 
and Mira (1998)) by sequentially comparing parts of the MCMC acceptance ratio 
to independent uniforms, in order to stop earlier the computation of the aforesaid 
ratio, namely as soon as one term is below the corresponding uniform. 

More formally, consider a generic Metropolis-Hastings algorithm where the 
acceptance ratio Tr(y)q(y,x)/n(x)q(x,y) is compared with a U(0, 1) variate to 
decide whether or not the Markov chain switches from the current value x to the 
proposed value y (Robert and Casella, 2004). If we now decompose the ratio as 
an arbitrary product 

d 

(2) ir(y) q{y , x)/ir(x)q(x, y) = JJ p k {x, y) 

k=X 

where the only constraint is that the functions p k are all positive and satisfy the 
balance condition p k (x, y) = p k (y, x ) _1 and then accept the move with probability 

d 

( 3 ) n min {p k {x,y), 1} , 

k =l 

i.e. by successively comparing uniform variates u k to the terms p k (x, y), the moti¬ 
vation for our delayed approach is that the same target density 7r(-) is stationary 
for the resulting Markov chain. 

The mathematical validation of this simple if surprising result can be seen 
as a consequence of Christen and Fox (2005). This paper reexamines Fox and 
Nicholls (1997), where the idea of testing for acceptance using an approximation 
and before computing the exact likelihood was first suggested. In Christen and 
Fox (2005), the original proposal density q is used to generate a value y that is 
tested against an approximate target 7f. If accepted, y can be seen as coming from 
a pseudo-proposal q that simply is formalising the earlier preliminary step and is 
then tested against the true target n. The validation in Christen and Fox (2005) 
follows from standard detailed balance arguments; we will focus formally on this 
point in Section 2. 

In practice, sequentially comparing those probabilities with d uniform variates 
means that the comparisons stop at the first rejection, implying a gain in com¬ 
puting time if the most costly items in the product (2) are saved for the final 
comparisons. 

Examples of the specific two-stage Delayed Acceptance as defined by Christen 
and Fox (2005) can be found in Golightly et al. (2014), in the pMCMC context, 
and in Shestopaloff and Neal (2013). 
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Figure 1 . Fit of a two-step Metropolis-Hastings algorithm applied to a normal-normal posterior 
distribution p\x ~ N(x/({ 1 + cr“ 2 }, 1/{1 + er“ 2 }) when x = 3 and = 10, based on T = 10 5 
iterations and a first acceptance step considering the likelihood ratio and a second acceptance 
step considering the prior ratio, resulting in an overall acceptance rate of 12% 



Figure 2. (left) Fit of a multiple-step Metropolis- Hastings algorithm applied to a Beta-binomial 
posterior distribution p\x ~ Be(x + a, n + b — x) when N = 100, x = 32, a = 7.5 and b = .5. The 
binomial B(N,p ) likelihood is replaced with a product of 100 Bernoulli terms and an acceptance 
step is considered for the ratio of each term. The histogram is based on 10 5 iterations, with an 
overall acceptance rate of 9%; (centre) raw sequence of successive values of p in the Markov 
chain simulated in the above experiment; (right) autocorrelogram of the above sequence. 


The major drawback of the scheme is that Delayed Acceptance efficiently re¬ 
duces the computing cost only when the approximation n is “good enough” or 
“flat enough”, since the probability of acceptance of a proposed value will always 
be smaller than in the original Metropolis-Hastings scheme. In other words, the 
original Metropolis-Hastings kernel dominates the new one in Peskun’s (Peskun, 
1973a) sense. The most relevant question raised by Christen and Fox (2005) is 
thus how to achieve a proper approximation; note in fact that while in Bayesian 
statistics a decomposition of the target is always available, by breaking original 
data in subsamples and considering the corresponding likelihood parts or even 
by just separating the prior, proposal and likelihood ratio into different factors, 
these decompositions may just lead to a deterioration of the algorithm properties 
without impacting the computational efficiency. 

However, even in these simple cases, it is possible to find examples where 
Delayed Acceptance may be profitable. Consider for instance resorting to a costly 
non-informative prior distribution (as illustrated in Section 5.3 in the case of 























































































4 


mixtures); here the first acceptance step can be solely based on the ratio of the 
likelihoods and the second step, which involves the ratio of the priors, does not 
require to be computed when the first test leads to rejection. Even more often, 
the converse decomposition applies to complex or just costly likelihood functions, 
in that the prior ratio may first be used to eliminate values of the parameter that 
are too unlikely for the prior density. As shown in Figure 1, a standard normal- 
normal example confirms that the true posterior and the histogram resulting from 
such a simulated sample are in agreement. 

In more complex settings, as for example in “Big Data” settings where the like¬ 
lihood is made of a very large number of terms, the above principle also applies 
to any factorisation of the like of (1) so that each individual likelihood factor can 
be evaluated separately. This approach increases both the variability of the eval¬ 
uation and the potential for rejection, but, if each term of the factored likelihood 
is sufficiently costly to compute, the decomposition brings some improvement 
in execution time. The graphs in Figure 2 illustrate an implementation of this 
perspective in the Beta-binomial case, namely when the binomial B(N,p ) obser¬ 
vation x = 32 is replaced with a sequence of N Bernoulli observations. The fit is 
adequate on 10 5 iterations, but the autocorrelation in the sequence is very high 
(note that the ACF is for the 100 times thinned sequence) while the acceptance 
rate falls down to 9%. (When the original y = 32 observation is (artificially) 
divided into 10, 20, 50, and 100 parts, the acceptance rates are 0.29, 0.25, 0.12, 
and 0.09, respectively.) The gain in using this decomposition is only appearing 
when each Bernoulli likelihood computation becomes expensive enough. 

On one hand, the order in which the product (3) is explored determines the 
computational efficiency of the scheme, while, on the other hand, it has no impact 
on the overall convergence of the resulting Markov chain, since the acceptance of 
a proposal does require computing all likelihood values. It therefore makes sense 
to try to optimise this order by ranking the entries in a way that improves the 
execution speed of the algorithm (see Section 3.2). 

We also stress that the Delayed Acceptance principle remains valid even when 
the likelihood function or the prior are not integrable over the parameter space. 
Therefore, the prior may well be improper. For instance, when the prior distribu¬ 
tion is constant, a two-stage acceptance scheme reverts to the original Metropolis- 
Hastings one. 

Finally, while the Delayed Acceptance methodology is intended to cater to 
complex likelihoods or priors, it does not bring a solution per se to the “Big Data” 
problem in that (a) all terms in the product must eventually be computed; and (b) 
all terms previously computed (i.e., those computed for the last accepted value of 
the parameter) must be either stored for future comparison or recomputed. See, 
e.g., Scott et al. (2013), Wang and Dunson (2013), for recent entries on different 
parallel ways of handling massive datasets. 

The plan of the paper is as follows: in Section 2, we validate the decomposition 
of the acceptance step into a sequence of decisions, arguing about the compu¬ 
tational gains brought by this generic modification of Metropolis-Hastings algo¬ 
rithms and further analysing the relation between the proposed method and the 
Metropolis-Hastings algorithm in terms of convergence properties and asymptotic 
variances of statistical estimates. In Section 4 we briefly state the relations be¬ 
tween Delayed Acceptance and other methods present in the literature. In Section 
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3 we aim at giving some intuitions on how to improve the behaviour of Delayed 
Acceptance by ranking the factors in a given decomposition to achieve optimal 
computational efficiency and finally give some preliminary results in terms of 
optimal scaling for the proposed method. Then Section 5 studies Delayed Accep¬ 
tance within three realistic environments, the first one made of logistic regression 
targets, the second one alleviating the computational burden from a Geometric 
Metropolis adjusted Langevin algorithm and a third one handling an original 
analysis of a parametric mixture model via genuine Jeffreys priors. Section 6 
concludes the paper. 


2. VALIDATION AND CONVERGENCE OF DELAYED ACCEPTANCE 

In this section, we establish that Delayed Acceptance is a valid Markov chain 
Monte Carlo scheme and analyse on a theoretical basis the differences with the 
original version. 

2.1 The general scheme 

We assume for simplicity that the target distribution n and the proposal distri¬ 
butions Q(x, •) all admit densities w.r.t. the Lebesgue or counting measures. We 
also denote by tt the target density and let q(x, y ) denote the proposal density. 

Let (A! n ) n >i be a Markov chain evolving on X with Metropolis-Hastings Markov 
transition kernel P associated with q and vr, i.e. for A € £>(X) 


P{x,A) := j q(x,y)oc(x,y)dy + ^1 - j q(x,y)a(x,y)dy S j 1 A {x), 


where 


a(x,y) := 1 A r(x,y), 


r(x,y) 


n{y)g{y,x) 
tt (x)q(x,y)' 


Above, a(x, y) is known as the Metropolis-Hastings acceptance probability and 
r(x, y) as the Metropolis-Hastings acceptance ratio. 

We consider the class of “Delayed acceptance” Markov kernels associated with 
P, which are defined by factorisations of the function r as 


d 

(4) r(x,y) = Y[pk(x,y) 

k=1 

with all components in the product satisfying pk(x,y) = pk{y,x)~ l . The associ¬ 
ated Delayed Acceptance Markov kernel is then defined as 


P(x,A):= / q(x,y)a(x,y)dy + 1- / q(x,y)a(x,y)dy) l A (x), 


where 


d 

a(x,y) := A pk(x, y)}. 
k=1 


We will denote by (X n ) n >i the Markov chain associated with P. 

The order in which the sequence of functions pk appears in the factorisation 
(4) is important for algorithmic specification, as can be seen in Algorithm 1. It 
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means that pi(x, Y) is evaluated first, p2(x, Y) second, and so on until pd(x, Y) = 
r(x, Y)/ Hfc=i Pk(x, Y) which is last, with the motivation that “early rejection” 
can allow computational savings by avoiding the computation of the subsequent 
Pk(x,Y). 

Algorithm 1 Delayed Acceptance 

To sample from P[x,-)\ 

1. Sample Y ~ Q(x, •)■ 

2. For k = 1,..., d: 

• With probability 1 A pk(x,Y) continue, otherwise stop and output x. 

3. Output Y. 


2.2 Validation 

The first lemma is a standard representation leading to the validation of the 
Delayed Acceptance Markov chain: 

Lemma 1. For any Markov chain with transition kernel II of the form 


U(x, A) = q(x, y)a(x , y)dy + (\ - q{x, y)a(x , y)dy^ 1 A (x), 

and satisfying detailed balance, the function a(-) satisfies (for n-a.a. x,y) 

a (x, y) , . 

— -r = r{x,y). 

a[y, x) 

Proof. This follows immediately from the detailed balance condition 
n(x)q(x,y)a(x,y) = ir(y)q(y,x)a(y,x). 


□ 


The Delayed Acceptance Markov chain (A n ) n >i is then associated with the 
intended target: 

Lemma 2. (A” n ) n >i is a ir-reversible Markov chain. 

PROOF. From Lemma 1 it suffices to verify that a(x,y)/a(y,x) = r{x,y). 
Indeed, we have 

a(x,y) = nLil 1 Apfc(s,j/)} 

Hy,x) nLi{lAp fc (y,x)} 

tt 1 A p k {x,y) 

1 A p k (y, x) 
d 

= Y\pk(x,y) = r(x,y), 
k=1 

since p k (y, x) = p k (x, y) -1 and (1 A a)/(l A a -1 ) = a for a £ R+. 
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Remark 1. It is immediate to show that 
d d 

d{x,y) = JJ{1 A pk(x,y)} < 1 A J ~[pk(x,y) = 1 A r(x,y) = a(x,y), 

k= 1 k= 1 

since (1 A a)(l A b) < (1 A ah) for a,b E M+. 


2.3 Comparisons of the kernels P and P 

Given a probability measure y, let us denote 

M/) : = f(x)y(dx) , L 2 ( E, y) := {/ : y(f 2 ) < oo} 

^o(E^) :={/GL 2 (E,/i) :) u(/)=0}. 

For a generic Markov kernel II : E x 6(E) with unique invariant probability 
measure y, we define 

var(/, n) := lirn var ( n"5 V [/(A*) - /i(/)] J , 

n—>■ oo \ z —^ / 

\ 2=1 / 

where (X n ) n >i is a Markov chain with Markov kernel II initialised with X\ ~ y. 

Remark 2. One can immediately conclude from the construction of P that 
var (f,P) < var(/, P) /or any f E L 2 (X, 7r), using Peskun ordering (Peskun, 
1973b, Tierney, 1998), since a(x,y) < a(x,y) for any (x,y) E X 2 . 


For any / E L 2 (E, y) we define the Dirichlet form associated with a /i-reversible 
Markov kernel II : E x £>(E) as 

£n (/) := \ J h(dx)U(x,dy) [/(x) - /(y)] 2 . 


The (right) spectral gap of a generic /i-reversible Markov kernel has the following 
variational representation 


Gap (n) := 


■ f £n (/) 
(fj) u 


which leads to the following domination lemma: 


Lemma 3 ((Andrieu et ah, 2013, Lemma 34)). Let ni and II 2 be y-reversible 
Markov transition kernels of y-irreducible and aperiodic Markov chains, and as¬ 
sume that there exists g > 0 such that for any f E Lq(E, y) 

£n 2 (/) > e£ih(f) , 


then 


and 


Gap(n 2 ) > yGap(Lli) 

var(/, n 2 ) < {g^ 1 - l)var (U (/) + y _1 var (/, Hi) / E Ljj(E, t)- 
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We will need the following condition in the sequel: 

Condition 1. Defining A := {(x,y) G X 2 : r(x,y) > 1}, there exists a c such 
that 


i*^(x,y)£A Pk(x, y) — c. 

Proposition 1. Assume Condition 1. Then Lemma 3 holds with IIi = P, 
II2 = P, g = 7r and g = c d_1 . 

Proof. Since r(x, y) > 1, we have a(x, y) = 1. On the other hand, for (x, y) G 

A 


d(x,y) 


II 1 A pk(x,y) 


k =1 


,|{fc:p fc (a-,y)<l}| > d-1 


H PfcO, y) > > f? 

k:p k (x,y)< 1 


since at least one pk(x,y) > 1 whenever r(x,y) > 1. 

From Lemma 1, when (x,y) G A, we have 

a(y,x) = a(x,y)/r(x,y) > c d ~ 1 a(x,y)/r(x,y) = c d_1 o:(y,x) 

and thus a(x,y ) > c d ~ 1 a(x,y) for any (x,y) G X 2 . It follows that 

£p(f) = [ Tr{dx)P(x,dy) (f(x) - f{y)) 2 

Jx 

= [ Tr(dx)P(x,dy) a ^ X,y \ (/(a) - /(r /)) 2 

> c^ 1 f n(dx)P(x,dy) (/(x) - /(y)) 2 = c d_1 £:p(/), 
Jx 

and we conclude. 


□ 


The implication of this result is that, if P admits a right spectral gap, then so 
does P, whenever Condition 1 holds. Furthermore, and irrespective of whether 
or not P admits a right spectral gap, quantitative bounds on the asymptotic 
variance of MCMC estimates using (. X n )n>\ i n relation to those using {X n ) n >\ 
are available. 

2.4 Modification of a given factorisation 

The easiest way to use the above result is to modify any candidate factorisation. 
Given a factorisation of the function r 

d 

r (x,y) = Y\pk(x,y ), 

k =1 

satisfying the balance condition, we can define a sequence of functions p^ such 
that both r(x,y) = Ylt=i Pk( x ,y) and Condition 1 holds. To that effect, take an 

1 

arbitrary c G (0,1] and define b := c 1 - 1 . Then, if we set 

Pk(x,y ) ■■= min|^,max{6,p fc (x,y)}| , k G {1,..., d - 1}, 
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it then follows that one must define 

Pd(x,y) := 


r(x,y ) 

Uk=iPk(x,y) 


From this modification, we deduce the following result: 


Proposition 2. Using this scheme, Lemma 3 holds with IIi = P, II2 = P, 
p = 1 r and g = c 2 . 

PROOF. We note that infA^gx 2 nf=i 1 A Pk( x > u) > = c and that 

Pd{x,y ) = d ^[ X,V ) —- > 6 d_1 r(x,y) = cr{x,y). 

Uk=\ Pk(x,y) 

With A := {(x,y) € X 2 : r(x,y) > 1}, it follows that mi^ x y \ e A Pd(x,y) > c, and 
so mir x y \ e Aa(x,y) > c 2 . We conclude along the same line as in the proof of 
Proposition 1. □ 


While this modification ensures that one can take g = c 2 in Proposition 1, 
it is too general to suggest that using P can be more computationally efficient 
than using P when the cost of evaluating each p is taken into account. Indeed, 
Proposition 2 holds when the functions pk are chosen completely arbitrarily. Of 
course in practice, one should choose p\ z and hence pk so that they are in some 
sense in agreement with r. 

We will show in the next example that a certain class of pk s are beneficial, 
namely those which correspond to Metropolis-Hastings acceptance ratios with 
“flattened” surrogate target densities. On the other hand, it is far from difficult 
to come up with surrogate target densities for which unmodified use of pk can 
lead to disastrous performance. 


2.5 Example: unmodified surrogate targets 


One common usage (Christen and Fox, 2005) of Delayed Acceptance is to 
substitute a surrogate target 7r for it in p\(x,y). We consider the case d = 2 and 
a random walk proposal to examine Condition 1 in this context. Here we have 
q(x,y) = q(y,x ) and so 

a(x,y) = 1 A 


while 


Pi(x,y) = 1 A 


-x(y) 

tt(x) ’ 


p- 2 (x,y) = 1 A 


n(y)n(x) 
7r(x)7r (y)' 


Considering (x, y) € A = {(x, y) € X 2 : r(x, y) > 1} we require c > 0 satisfying 
simultaneously 

n(y) n{y)Tt(x) 

7r(x) — ’ 7r(x)7T (y) — 

The first of these says that 7f(y)/7r(x) cannot be too small when n(y) > ir(x). 
The second says that Tt(y)/Tt(x) should not be a large multiple of Tr(y)/-7r(x). 
There are a large variety of choices of 7r that allow one to take c = 1. For 
example, tt(x) = ir(x) + C for some constant C > 0 and tt(x) oc 7 t ( x )^ for some 
/3 e [0, 1]. Note that /3 = 0 corresponds to 7r being a constant function and (5 = 1 
corresponds to 7r oc 7 r. In between, one can think of 7f as being a flattened version 
of 7 r. 
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2.6 Counter-example: failure to reproduce geometric ergodicity 

Consider the case 7r(x) = AA(x;0,1) and 7 r(x) = Af (x;0, a 2 ) with Q(x, •) a 
normal distribution with mean x and fixed variance for each 16 R. Here we have 


pi(x,y) = exp 


(x 


y)(x + y) 

2a 2 


P 2 (x,y) = exp 


(1 ~ ° 2 )(y ~ x)(y + x) 
2a 2 


Mengersen and Tweedie (1996) showed that a random-walk Metropolis-Hastings 
chain for targets with super-exponential tails is geometrically ergodic. We now ex¬ 
ploit this result to derive that, if a 2 < 1, then the unmodified delayed acceptance 
Markov chain is not geometrically ergodic. 


Proposition 3. The unmodified Delayed Acceptance Markov chain using the 
factorisation into p\ and P 2 as above is not geometrically ergodic when a < 1. 

Intuitively, when x is large P(x,(— 00 , x)) ~ ^ but linx^oo _P(x, {x}) = 1 
because p\(x,y) takes on smaller and smaller values for y > x and p 2 (x,y) takes 
on smaller and smaller values for y < x. 


Proof. From Roberts and Tweedie (1996, Theorem 5.1), it suffices to show 
that 7T—essinf xg x P{x, {x}^) = 0, i.e. that for any r E (0,1) we can find A C X 
such that 7 t(A) > 0 and sup^g^ P(x, {x}^) < r. Let B s (z) denote the ball of 
radius s around z. Given t E (0,1), we define 

r := sup{s > 0 : Q(x, B s (x)) < r/3}, 


and 

Then 


A := 


x : x > - — 


cr 2 log (r/3) 
r(l — a 2 ) 




P(x,{x} C ) = P(x, B r (x) \ {x}) + P(x, Br(x)) 


< 


/ Q(x,dy)a(x,y) 


3 J 

B$(x) 

< 

2 T 

NT + 

/ Q(x,dy)a(x,y) 


3 

J Rp(a;)nR+ 


2t 

sup d{x,y) 

< 

RT + 


3 

j/£Sp(o;)nIR-)- 

= 

2t 

NT + 

sup [1 Api(x,y)\ [1 Ap 2 (x,y)\ 


3 

y&B^.(x) nK+ 


Now let x E A, y E B^(x) nM+ and assume y < x. It follows that P 2 (x, y) attains 
its maximum when y = x — r and therefore 


P 2 {x,y) < exp 


(1 — <r 2 )r(r — 2x) | 

2^2 J 

(1 — a 2 )r 2a 2 log(r/3) 
2c 2 r(l — a 2 ) 


T 

3' 


< exp 










DELAYED ACCEPTANCE FOR METROPOLIS-HASTINGS 


11 


Similarly, let x G A, y G B^(x) n M+ and assume y > x. It follows that pi(x,y) 
attains its maximum when y = x + r and therefore 


Pi{x,y) < exp 

< exp 

< exp 


r(2x + r) 'I 

2cP J 

_ T _( 0r _ 2 ° 2 l °s(r/3) \ \ 

2a 2 V r(l — cr 2 ) Jj 

r f 2 cr 2 log(r/3) \ It 
2 a 2 V r(l — cr 2 ) ) J “ 3’ 


since log(r/3) < 0 and a 2 < 1. Therefore, 


T 

sup [lApi(x,j/)] [lA/) 2 (a:,2 /)]<- 

j/es^( 3; )nR + ^ 


so P(x, {.t}^) < r and we conclude. 


□ 


The same argument can be made for much more general targets and proposals, 
albeit at the expense of brevity and clarity. We refrain from such a generalisation 
as our purpose here is to demonstrate that the DA chain may fail to inherit 
geometric ergodicity and that the simple proposed modification of the Delayed 
Acceptance kernel provided in Section 2.4 allows one to avoid this. 


3. OPTIMISATION 

When considering Markov Chain Monte Carlo methods in practice, their effi¬ 
ciency as measured by mixing properties and computational cost is a fundamental 
issue. This section addresses both perspectives in connection with Delayed Ac¬ 
ceptance. Section 3.1 examines the proposal distribution and derives its optimal 
scaling from standard random-walk Metropolis-Hastings theory. Then Section 
3.2 covers the ranking of the factors pi, which drives the total computational cost 
of the procedure. 

3.1 Optimising the proposal mechanism 

The explorative performances of a random-walk MCMC are strongly depen¬ 
dent on its proposal distribution. As exemplified in Roberts et al. (1997), finding 
the optimal scale parameter does lead to efficient ‘jumps’ in the state space and 
the overall acceptance rate of the chain is directly connected to the average jump 
distance and to the asymptotic variance of ergodic averages. This provides prac¬ 
titioners with an approach to ‘auto-tune’ the resulting random-walk MCMC al¬ 
gorithm. Extending this calibration to the Delayed Acceptance scheme is equally 
important, on its own ground towards finding a reasonable scaling for the proposal 
distribution and to avoid comparisons with the standard Metropolis-Hastings 
version. 

The original framework of Roberts et al. (1997) is cantered on estimating 
a collection of expected functionals, say g, where a plausible criterion for the 
performances of the MCMC is the minimisation of the stationary integrated auto¬ 
correlation time (ACT) of the Markov chain, defined as 

OO 

t 9 = 1 + 2 CorMX 0 ),g(Xi)) 

i=1 
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where the index g stresses the dependence on the considered functional, which is 
connected to the asymptotic variance through var (P,g) = r g x var n (g) whenever 
the chain is ^-irreducible, aperiodic, and reversible, var n (g) is finite and g £ 
L \n). 

Research on this optimisation focus on two main cases: 

Consider the limit in the dimension of the target distribution toward oo, 
where Roberts et al. (1997) gave conditions under which each marginal 
chain converges toward a Langevin diffusion. Maximising the speed of that 
diffusion, say h(£) where £ is a parameter of the scale of the proposal, implies 
a minimisation of the ACT and also that r is free from the dependence on 
the functional, defining thus an independent measure of efficiency for the 
algorithm; 

Sherlock and Roberts (2009) focus on unimodal elliptically symmetric tar¬ 
gets and show that a proxy for the ACT in finite dimensions is the Expected 
Square Jumping Distance (ESJD), defined as 

E it*; - *> 2 

,i=i ^ 1 

where X and X' are two successive points in the chain and || • represent 
the norm on the principal axes of the ellipse rescaled by the coefficients /3i 
so that every direction contributes equally. 

An interesting result in Sherlock and Roberts (2009) is that, as d —> oo, the 
ESJD on one marginal component of the chain converges with the same speed as 
the diffusion process described in Roberts et al. (1997). These authors furthermore 
show the asymptotic result holds for rather small dimension, roughly starting 
from d = 5. 

Moreover, when considering efficiency for Delayed Acceptance, which is a tech¬ 
nique tailored on costly computations, we need to focus on the execution time of 
the algorithm as well. We then proceed to define our measure of efficiency as 

(5) Eff := ESJD/cost per iteration 

similarly to Sherlock et al. (2013) for Pseudo-Marginal MCMC. 

Due to the complex acceptance ratio in Delayed Acceptance, an extension 
of the previous results requires rather stringent assumptions, albeit providing a 
proper guideline in practice. Section 5 will further demonstrate optimality ex¬ 
tends beyond those conditions. Note that our assumptions are quite standard in 
the literature on the subject. 

(HI) We assume for simplicity’s sake that the Delayed Acceptance procedure 
operates on two factors only, i.e., that r(x, y) = pi(x, y ) X p 2 {x , y). The acceptance 
probability of the scheme is thus 

2 

a(x,y) = JJ(1 Api(x,y)). 

2=1 

We also consider the ideal setting where a computationally cheap approximation 
/(•) is available for 7r(-) and precise enough so that p 2 {x,y) = r(x,y)/p±(x,y) = 


E[\\X'-X\\}] = E 
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n(y)/n(x) x f(x)/f(y) = 1. 

(H2) We further assume that the target distribution satisfies (Al) and (A2) in 
Roberts et al. (1997), which are regularity conditions on n and its first and second 

n 

derivatives, and that 7r(x) = II 

2=1 

(H3) We consider a random walk proposal y = x + \[Wfd Z. where Z ~ jV(0, Id). 
Note that Gaussianity can be easily relaxed to distributions with finite fourth mo¬ 
ment and similar results are available for more heavy-tailed distributions (Neal 
and Roberts, 2011). 

(H4) Finally we assume that the cost of computing /(•), say c, is proportional to 
the cost of computing vr(-), named C , with c = 5C. 


Normalising costs by setting (7 = 1, the average total cost per iteration of the 
Delayed Acceptance chain is 5 + IE [a] and the efficiency of the proposed method 
under the above conditions is 


Eff(<5, i) 


ESJD 
5 + E [a] 


Lemma 4. Under the above conditions (H1-H4) on the target tt(x), on the 

2 

proposal q(x,y) and on the factorised acceptance probability a(x,y) = D[(l A 

2=1 

Pi(,x,y)) we have that 

a(x,y) = (1 A pi(x,y)) 


and that as d —> oo 


Eff (6, £) = 


2£ 2 $(_M) 

h + E[d] J + 2 $(-^£) 


m 


where I := E 


(^G))' 

7 t(x) 


»(0 = E[a]=24.(-AZ) 

as defined in Roberts et al. (1997). 


Proof. It is easy to see that (HI) implies /(•) = 7r(-) and so p\(x,y) = 
r(x,y). Moreover, by definition, p 2 (x,y) = r(x,y)/p\(x,y) = 1 and hence the 
second test is always accepted. The acceptance rate reduces then to just the ratio 

f(y)/f 0*0 = Pi(x,y). 

The second part of the lemma follows directly from Theorem 1.1 in Roberts 
et al. (1997). □ 

Let us stress that almost all assumptions in the above Lemma can be relaxed 
and that performances are robust against small deviances from those assumptions, 
as shown by the literature on standard Metropolis-Hastings. Obtaining analytical 
results without such conditions, while possible, requires however a considerable 
mathematical effort. 

We now state the main practical implication of Lemma 4. 
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Figure 3. Two top panels: behaviour of£*(5) and a* (5) as the relative cost varies. Note that for 
8 » 1 the optimal values converges towards the values computed for the standard Metropolis- 
Hastings (dashed in red). Two bottom panels: close-up of the interesting region for 0 < <5 < 1. 



Proposition 4. If the conditions of Lemma f holds, the optimal average 
acceptance rate a* (5) is independent of I. 

Proof. Consider E S(S,£) in terms of (5,a(£)): 

eVI\ 


a = g(£) = 2$ — 


*-M - 2 ) Ti 


Eff (6, a) = 




6 + a 


5 + a 


$ 


-l 


2 a 


where we dropped the dependence on £ in a for notation’s sake. It is now evident 
that to maximise Eff(d, a) in a we only need maximise | < h~ 1 (| ) 2 a |,which 

is independent of I. □ 


The optimal scale of the proposal £*(6) and the optimal acceptance rate a* (5) 
are thus given as functions of 5. In particular, as the relative cost of computing 
pi(x,y) with respect to p 2 (x,y) decreases, the proposed moves become bolder, 
in that £* increases and a* decreases, since rejecting costs the algorithm little 
in terms of time, while every accepted move results in an almost independent 
sample. On the contrary when 6 grows larger the chain rapidly approaches a 
Metropolis-Hastings behaviour, as it is no longer convenient to reject early. Fig¬ 
ure 3 helps visualise the result. 
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3.2 Ranking the Blocks 

As mentioned at the end of Section 1, the order in which the factors Pi(x, y ) are 
tested has a strong influence on the performance of the algorithm. Delayed Accep¬ 
tance was first developed in Christen and Fox (2005), Fox and Nicholls (1997) to 
speed up computations using a cheap approximation ir(-) of the target distribution 
7r(-) as a first step before computing the actual, and costly, Metropolis-Hastings 
ratio ir(y)/Tr(x) only in the cases where the acceptance test based on the approx¬ 
imation 7r was satisfied. The main idea, namely to avoid the computation of the 
most costly parts as often as possible, remains relevant even for factorisations 
composed of more than two terms. 

Consider an i.i.d. framework; the target (in x) is given by 

n 

ir(x\Z) oc p(x) x L{Z\x ) = p(x\Z ) x f(zi\x) 

i= 1 

where Z = (zi,...,z n ) is an i.i.d. sample from f(z\x ) and p{x) is the prior 
distribution for x , we can always consider the decomposition 

K 

( 6 ) r(x,y) = Y[&{x,y) 

2=1 


where each £j(x,y) is made of a small number of density ratio terms, with one 
including the prior and proposal ratios. In the limit, it is feasible if not necessarily 
efficient to consider the case K = n + 1 with 


&(x,y) 


f(zj\y) 

f(zi\x) 


* = !,■■■ ,n and £n+i (x,y) 


p(y)q(y,x) 

p(x)q(x,y)' 


Assuming the computing cost is comparable for all terms, a solution for opti¬ 
mising the order of these factors ranks the entries according to the success rates 
observed so far, starting with the least successful values. Alternatively, the fac¬ 
torisation can start with the ratio that has the highest variance, since it is the 
most likely to be rejected. (Note however that poor factorisations (6) lead to 
very low acceptance rates, as for instance when picking only outliers in a given 
group of observations.) Lastly, we can rank factors by their correlation with the 
full Metropolis-Hastings ratio; taking the argument to the limit, if the first fac¬ 
tor has a perfect correlation with r(-, ■) then all the successive terms must be 
accepted and their order is hence of no interest. 

This later setting is akin to considering the hypothetical optimal solution in¬ 
troduced in Section 3.1 with only two terms in the decomposition. Let a small 
number of the best scoring terms be merged to form p\ and let the remain¬ 
ing factors become p2 • pi(x,y) = 7r(y)/7f(x) is then highly correlated to r(x,y ), 
P 2 {x,y) 1 for every (x, y) and hence n(x) is a close approximation of the target, 

albeit probably flattened, which is exactly what we want (see Section 2). 

As all these features can be evaluated for each subsample while running a 
chain with acceptance ratio factored as in (6), an implementation based on this 
intuition is then to take 


nz* (*) oc p(x) m/n f(z*\x) 
2=1 


m 

n z *(x) oc Y[f(z*\x) 
2=1 


or 
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with m < n, where Z* = (z*,..., zffi) is a subsample of Z. At each itera¬ 
tion t of the Markov chain we compute all the ^,2/) and Z*^ is cho¬ 

sen as the subset that maximise the observed correlation between the values of 
(j = 1,... , t — 1) and the full Metropolis-Hastings ratio (or whatever 
other selected criterion). As computing the real argmax^» c ^ is expensive, in our 
practical implementation we resort to a forward selection scheme; starting with 
the factor with the maximal correlation we build merging one term at a 
time until a desired correlation level is achieved, the observed correlation after 
including another term does not grow more than a small e > 0 or the size of Z*^> 
has reached a critical point for computational purposes (e.g. 10% of the whole 
sample Z). 

A relevant warning is that if we rearrange terms during the run, not only 
reordering but also merging them, in accordance to their correlation with the 
unmodified ratio, the resulting method has no theoretical guarantee since the 
kernel is potentially changing at each iteration depending on properties of previ¬ 
ous samples (Gelfand and Sahu, 1994). 

As with standard adaptive MCMC (Roberts and Rosenthal, 2005) we resort 
thus to a finite adaptation scheme ; we start with a fixed number of iterations to 
rank and rearrange the factors, followed by a fixed ordering to achieve ergodicity 
of the chain. We test this procedure in Section 5.1 on a simulated example. 

Finally note that while we focused on the i.i.d. setting, in more complex cases 
where the ratio is factored and Delayed Acceptance can be applied, it is often 
the case that the optimal ordering of such factors is already known. 

4. RELATION WITH OTHER METHODS 
4.1 Delayed Acceptance and Prefetching 

Prefetching, as defined by Brockwell (2006), is a programming method that 
accelerates the convergence of a single MCMC chain by guessing future states in 
the path of a random walk Metropolis-Hastings Markov chain in order to use any 
additional computing power available, in the form of extra parallel processors, to 
calculate in advance necessary quantities (like the Metropolis-Hastings ratio) so 
that when the chain reaches a given state the computationally-heavy part of that 
iteration are ready. 

Clearly the usefulness of this technique depends on our ability to guess the path 
of the chain correctly and hence many advanced prefetching strategies make use 
of the observed acceptance rate of the chain or even of a fast approximation if of 
the target distribution to select the most likely future outcomes. 

Since an in-depth exploration of prefetching is outside the scope of this work the 
reader is referred to Strid (2010) and citations therein for a complete discussion 
of the argument. 

As mentioned above and demonstrated in Angelino et al. (2014), Strid (2010) if 
a cheap approximation if of the target density is available, it can be used to select 
more likely future paths of the chain and this results in an efficient prefetching 
algorithm. 

In our case the master process sequentially samples from the (Delayed Ac¬ 
ceptance) chain by checking only the (assumed) inexpensive first approxima¬ 
tion pi(x,y) = if (i/)/tt(x) while the other additional processors provide him the 
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more expensive p 2 {x,y) = n(y)Tr(x)/n(x)n(y) computed beforehand thanks to 
prefetching. The theoretical properties of the chain are unchanged while the 
achievable speed-up may be substantial, especially for the first few additional 
processors. 

4.2 Alternative procedure for Delayed Acceptance 

In the case that every factor pt{x, y ) has roughly the same computational cost, 
Philip Nutzman suggested (personal communication) that Delayed Acceptance 
can be slightly modified by taking the overall acceptance probability 

d ( k 

1 [ min {pi(x,y), 1} to be instead min <1 \ Pi{x,y), 1 

i=l ’ U=1 

Such a decomposition follows from the same idea that one would like to com¬ 
pute as few factors as possible once one realizes that the proposal is likely to 
be rejected. Under this modification the associated Markov chain still achieves 
the correct target in the stationary regime and the procedure satisfies detailed 
balance, provided the ordering of the terms is uniformly random. 

An interesting consequence of this modification is that, as the number of fac¬ 
tor increases, the acceptance rate eventually stabilises, while for the method de¬ 
scribed in Section 1 the acceptance rate decreases to zero. This property is indeed 
appealing, even thought this procedure logically takes longer to complete when 
compared with the standard Delayed Acceptance (albeit less than the reference 
Metropolis-Hastings procedure). 

The evident disadvantage of the modification in a general setting is that de¬ 
tailed balance implies that the factors are computed in a random order at each 
iteration, making vain any attempt to adapt in terms of the ordering (Section 
3.2) or to set the order based on respective computational costs. 

This drawback can be somewhat reduced by combining the above two ap¬ 
proaches; consider the decomposition 




" di 


di 

n(y)q(y,x)/ir(x)q(x,y) = 

_i=1 

X 

n 4>j(x,y) 

j =i 


where d\ + c ?2 = d and the factors pi and f>j represent respectively cheap factors 
and costly factors. By taking now 


min 

(m=l,... ,di) 


l,Y[pi{x,y) 


i=1 


X min 

(fc=l,...,d 2 ) 


k \ 

i, > 

3 = 1 J 


the algorithm computes cheap factors first and expensive factors last, applying 
the symmetry requirement to satisfy detail balance inside each of both subsets. 
Clearly the above can be generalised to a larger number of subsets, each with 
di factors in it. Intuitively, this last modification can be explained as an early 
rejection of each of the intermediate acceptance/rejection steps inside a Delayed 
Acceptance scheme. 


Remark 3. Interestingly if di = 1 VZ (l being the number of subsets consid¬ 
ered) this procedure reduces to Delayed Acceptance, and for l that increases and 
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di > 1 V/ this combined technique will have a even lower overall acceptance rate 
than standard Delayed Acceptance. 

4.3 Delayed Acceptance and Slice Sampling 

As a final remark, we stress another analogy between our Delayed Acceptance 
algorithm and slice sampling (Neal, 1997, Robert and Casella, 2004). Based on 
the same decomposition (1), slice sampling proceeds as follows 

1. simulate u\, ..., u n ~ 77(0,1) and set A * = Ui£(0\xi) (i = 1,..., n); 

2. simulate O' as a uniform under the constraints £i(0'\xi) > A* (i = 1,... ,n). 

to compare with Delayed Acceptance which conversely 

1. simulate O' ~ q(Q'\Q)\ 

2. simulate u\,.. ., u n ~ 77(0,1) and set \ = Uii{0\xi) (i = 1,..., n); 

3. check that ^(f^lxj) > Aj (i = 1,..., n). 

The differences between both schemes are thus that (a) slice sampling always 
accepts a move, (b) slice sampling requires the simulation of O' under the con¬ 
straints, which may prove infeasible, and (c) Delayed Acceptance re-simulates the 
uniform variates in the event of a rejection. In this respect, Delayed Acceptance 
appears as a “poor man’s” slice sampler in that values of O's are proposed until 
one is accepted. 


5. EXAMPLES 

To illustrate the improvement brought by Delayed Acceptance, we study three 
different realistic settings that reflect on the generality of the method. First, in 
Section 5.1 we consider a Bayesian analysis of a logistic regression model, to assess 
the computational gain brought by our approach in a “Big-Data” environment 
where obtaining the likelihood is the main computational burden. Secondly (Sec¬ 
tion 5.2) we examine a high dimensional toy Normal-Normal model, sample with 
a geometric Metropolis adjusted Langevin algorithm where the main computa¬ 
tional cost comes from the proposal distribution which is position specific and 
involves derivatives of the density up till the third level, which are computed nu¬ 
merically at each iteration. Finally in Section 5.3 we investigate a mixture model 
where a formal Jeffreys prior is used, as it is not available in closed-form and does 
require an expensive approximation by numerical or Monte Carlo means. The lat¬ 
ter example comes as a realistic setting where the prior itself is a burdensome 
object, even for small datasets. 

5.1 Logistic Regression 

While a simple model, or due to its simplicity, logistic regression is widely 
used in applied statistics, especially in classification problems. The challenge in 
the Bayesian analysis of this model is not generic, since simple Markov Chain 
Monte Carlo techniques providing satisfactory approximations, but stems from 
the data-size itself. This explains why this model is used as a benchmark in some 
of the recent accelerating papers (Korattikara et al., 2013, Neiswanger et ah, 
2013, Scott et ah, 2013, Wang and Dunson, 2013). Indeed, in “big Data” setups, 
MCMC is deemed to be progressively inefficient and researchers are striving to 
keep simulation effective, focusing mainly on parallel computing and on sub¬ 
sampling but also on replacing the classic Metropolis scheme itself. 
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Algorithm 

relative ESS (aver.) 

relative ESJD (aver.) 

relative Time (aver.) 

DA-MH over MH 

1.1066 

12.962 

0.098 


Algorithm 

relative Eff gain (ESS) (aver.) 

relative Eff gain (ESJD) (aver.) 

DA-MH over MH 

5.47 

56.18 


Table 1 

Comparison between MH and MH with Delayed Acceptance on a logistic model. ESS is the 
effective sample size, ESJD the expected square jumping distance, time is the computation 

time. 


We tested the proposed method against the standard Metropolis-Hastings al¬ 
gorithm on 10 6 simulated data with a 100-dimensional parameter space. The pro¬ 
posal distribution is Gaussian: y\x ~ A f(x, X) with X initialised to be 0.2 x Iy (d 
being the dimension of the parameter space) and then adapted. The Metropolis- 
Hastings benchmark was made adaptive by targeting the asymptotic optimal 
acceptance rate of a* = 0,234 (Roberts et al., 1997). 

Delayed Acceptance was optimised first against the ordering of the factors as 
explained in Section 3; we split the data into subsamples of 10 elements and 
computed their empirical correlation with the full Metropolis-Hastings ratio as a 
criterion. Once these estimates were stable we merged into the surrogate target / 
the smallest number of subsamples needed to achieve a > 0, 85 correlation with 
r(x, y ). As soon as the ordering was fixed we computed 5, the relative cost of the 
obtained p\ with respect to the full ratio, and run the chain for the remaining 
iterations optimising X against the optimal acceptance rate found through (5). 
We also added the modification explained in Section 2.4 with c set such that b 
was slightly lower than the optimal acceptance rate above. 

We collected 100 repetitions of the experiment and the results are presented in 
Table 1. Before commenting the results we highlight the fact that this situation 
may seem not particularly appealing for Delayed Acceptance and in fact straight 
application of the method by randomly choosing the composition of p\ and p 2 may 
lead to variable results. Further coding effort is required here in order to choose 
adaptively how to split the MH ratio. Borrowing from both Section 3.2 and the 
end of Section 3.1, i.e. by choosing during the brief burn-in of the chain which 
subset best represents the whole likelihood and then, based on how populated 
that subset is, targeting a specific acceptance ratio, produces both a completely 
automated MCMC version for this kind of data (iid) and better results under a 
time constraint. 

As shown in Table 1, while the assumption made in Section 3 not completely 
satisfied, the relative efficiency of Delayed Acceptance is higher that for MH by 
a factor of almost 6. We measured efficiency trough effective sample size (ESS, 
from the coda R package (Plummer et ah, 2006)) or expected square jumping 
distance (ESJD). By choosing the first subsample to be informative on the whole 
ratio there is practically no loss on ESS (while the estimated ESJD actually 
increased) and, given the significantly reduced acceptance rate, the computing 
time is usually less then a fourth of the computing time of the corresponding 
optimal MH, taking into account the first part of chain used to determine the 
blocks ranking. 
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5.2 G-MALA with Delayed Acceptance 

5.2.1 MALA and Geometric MALA: Random walk Metropolis-Hastings, while 
generic and popular, can struggle with posterior distributions in high dimensions 
or in the presence of high correlation between some components. In such cases it 
is inefficient, with low acceptance rate, poor mixing and highly correlated sam¬ 
ples. Metropolis adjusted Langevin algorithm (MALA, see for instance Roberts 
and Strainer (2002)) has been devised to overcome these difficulties by taking 
advantage of the gradient of the target distribution in the proposal mechanism, 
making the Markov chain more robust with respect to the dimension of the prob¬ 
lem and proposing broader moves with higher probability. MALA is based on a 
Langevin diffusion, with the target (the posterior distribution n(0\y) in our case) 
as a stationary distribution, defined by the SDE 

dO ..dt dB 

s =V 9 log «%))- + — 

where B is a Brownian motion. Using a first-order discretisation the diffusion 
gives the following proposal mechanism: 

0' = 0^~ l ) + e 2 V 0 log(n(d^\y))/2 + eZ 

where e is the step-size for the Euler’s integration and Z ~ jV(0,1). This discreti¬ 
sation is then compensated by introducing an accept/reject probability similar 
to a Metropolis-Hastings algorithm. 

This diffusion is isotropic and will hence still be inefficient for highly corre¬ 
lated components or with very different scales, as the step size e is fixed across 
dimensions. Roberts and Strainer (2002) propose to alleviate the issue using a 
pre-conditioning matrix A so that the proposal becomes 

0’ = flC*" 1 ) + £ 2 A T A\7 e log(vr(^- 1 )|y))/2 + eAZ. 

Christensen et al. (2003) demonstrate however that defining this matrix in gen¬ 
eral can be difficult and that tuning on the go may result in an inappropriate 
asymptotic behaviour. 

In a recent work Girolami and Calderhead (2011) propose the Geometric- 
MALA in order to overcome this difficulty, advising the use of a position specific 
metric for the matrix A, which takes advantage of the geometry of the target space 
that the chain is exploring. They suggest in particular the Fisher-Rao metric ten¬ 
sor. In terms of Bayesian inference, where the target distribution is the posterior 
density, this choice translates into A T A being the expected Fisher information 
matrix plus the negative Hessian of the log-prior. 

This theoretically efficient solution also performs well in practice but comes 
with a serious computational burden in the fact that at every evaluation of the 
Metropolis-Hastings ratio derivatives up till the third order of our log-target 
distribution are needed and, in the event of them being analytically not available, 
expensive numerical approximations are to be computed (see equation (10) of 
Girolami and Calderhead (2011)). 

5.2.2 Sampling with Delayed Acceptance and GMALA: Geometric-MALA rep¬ 
resent a perfect application for Delayed Acceptance since we can naturally divide 
its acceptance ratio into the product of the posterior ratio and the ratio of the 
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proposals, the latter to be only computed when the proposed point is associated 
with a relatively large posterior probability. 

As described above, the computational bottleneck of the G-MALA lays in the 
computation of the third derivative of our log-target at the proposed point, while 
the computation of the posterior itself has usually a low relative cost. More¬ 
over even with a non-symmetric efficient proposal mechanism (the discretised 
Langevin diffusion) G-MALA is still close to a random walk and we expect the 
ratio of the proposal to be near 1 , especially at equilibrium especially when e is 
small. Therefore, the first ratio is inexpensive, relative to the second one, while 
the decision reached at the first stage should be consistent with the overall ac¬ 
ceptance rate. 

Given that optimal scaling for MALA in terms of the dimension d of the target 
differs from the random-walk setting (see Roberts and Rosenthal, 2001), we set 
the variance of the random-walk normal component as a% = ^ 173 - Borrowing 
from Section 3.1, we can obtain the optimal acceptance rate for the DA-MALA, 
through Equation (5), by maximising 


EfF (<5, i) 


Hi) 

6 + E [a] - 5 x E [a] 


2£ 2 3>{-^f) 

5 + 2^>{-^f) x (1 — <5) 


or equivalently 

6 + a(l — J) 

In the above the computational cost per iteration is taken to be c = 5C for the 
posterior ratio, C = 1 for the proposal ratio (and hence c + E [a] (C — c) for the 
whole kernel), h(£) is again the speed of the limiting diffusion process and K is 
a measure of roughness of the target distribution, depending on its derivatives. 
Since the optimal a* is independent from K, we do not define it more rigorously, 
referring to Roberts and Rosenthal (2001). Figure 4 shows that a* decreases with 
5, as is the case with random-walk Metropolis-Hastings. It reaches the known 
optimum for the standard MALA when <5 = 1. 

5.2.3 Simulation study: To test the above assumptions we ran a toy MALA 
example where we drew 100 samples from a J\Td(0,I), with d = 10; tt(9) was 
set to be A/rf(0,100). Figure 5 presents an example run. We then repeated the 
experiment 100 times and computed an average efficiency gain, defined either as 
ESS or as the ESJD, over the computing time. We computed 6 at each run by 
averaging a few computed derivatives, required by the proposal ratio. We then 
adapt e to get the optimal acceptance rate, being conservative in order to avoid 
overflow issues with the first-order numerical integrator. Results are presented 
in Table 2. Delayed Acceptance exhibits improvement by a factor of 10 in this 
example, obtained almost for free in terms of to coding time. 

5.2.4 HMC with Delayed Acceptance: As a side note, while the reasoning 
applied to MALA does theory apply to Hamiltonian Monte Carlo (HMC), the 
computational gain obtained through Delayed Acceptance is only connected with 
avoiding some proposal computations. In a general HMC though (with both 
point-dependent and independent pre-conditioning matrices), proposing a new 


Eff(M = -(| X3 
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Figure 4. Optimal acceptance rate for the DA-MALA algorithm as a function of 5. In red, the 
optimal acceptance rate for MALA obtained by Roberts and Rosenthal (2001 ) is met for <5=1. 



0.0 0.2 0.4 0.6 0.8 1.0 

6 


Algorithm 

ESS (aver.) 

ESS (sd) 

ESJD (aver.) 

ESJD (sd) 

time (aver.) 

time (sd) 

MALA 

7504.486 

107.21 

5244.946 

983.473 

176078 

1562.3 

DA-MALA 

6081.023 

121.42 

5373.253 

2148.761 

17342.91 

6688.3 


Algorithm 

a (aver.) 

ESS/time (aver.) 

ESJD/time (aver.) 

MALA 

0.661 

0.04 

0.03 

DA-MALA 

0.09 

0.35 

0.31 


Table 2 

Comparison between standard geometric MALA and geometric MALA with Delayed 
Acceptance, with ESS the effective sample size, ESJD the expected square jumping distance, 
time the computation time and a the observed acceptance rate. 
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Figure 5. Comparison between geometric MALA (top panels) and geometric MALA with De¬ 
layed Acceptance (bottom panels): marginal chains for two arbitrary components (left), estimated 
marginal posterior density for an arbitrary component (middle), ID chain trace evaluating mix¬ 
ing (right). 
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value still involves the computation of L — 1 (with L the number of steps in 
the discretised-Hamiltonian integration) derivatives, as only the starting point 
is recovered from the previous iteration, while computing the final Metropolis- 
Hastings ratio involves just the extra computation at the end point. Therefore, 
in this setting, the computational gain is much reduced. 

5.3 Mixture Model 

5.3.1 Non-Informative inference on a Mixture Model: Consider a standard 
mixture model (MacLachlan and Peel, 2000) with a fixed number of components 

k k 

(7) ^2wif(x\0i), with y~] Wj = f. 

2=1 2=1 

This standard setting nonetheless offers a computational challenge in that the 
reference objective Bayesian approach based on the Fisher information and the 
associated Jeffreys prior (Jeffreys, 1939, Robert, 2001) is not readily available for 
computational reasons and has thus not been implemented so far. Proxys using 
Jeffreys priors on the components of (7) have been proposed instead, with the 
drawback that since they always lead to improper posteriors, ad hoc corrections 
have to be implemented (Diebolt and Robert, 1994, Roeder and Wasserman, 
1997, Stephens, 1997). 

When relying instead on dependent improper priors, it is not always the case 
that the posterior distribution is improper. For instance, Robert and Tittering- 
ton (1998) provide a location-scale representation that allows for some improper 
prior. In the current paper, we consider instead the genuine Jeffreys prior for the 











24 


complete set of parameters in (7), derived from the Fisher information matrix for 
the whole model. While establishing the analytical properness of the associated 
posterior is beyond the goal of the current paper, we handle large enough samples 
to posit that a sufficient number of observations is allocated to each component 
and hence the likelihood function dominates the prior distribution. (In the event 
the posterior remains improper, the associated MCMC algorithm should exhibit 
a transient behaviour.) 

Therefore, this is an appropriate and realistic example for evaluating Delayed 
Acceptance since the computation of the prior density is clearly costly, relying 
on many integrals of the form: 


( 8 ) 


d 2 log J2i=l W if( X \ 0 i) 


'X 


dO h d6j 


k 

Y Wif(x\6i ) 

_i= 1 


dx. 


Indeed, these integrals cannot be computed analytically and thus their derivation 
involve numerical or Monte Carlo integration. This setting is such that the prior 
ratio—as opposed to the more common case of the likelihood ratio—is the costly 
part of the target evaluated in the Metropolis-Hastings acceptance ratio. More¬ 
over, since the Jeffreys prior involves a determinant, there is no easy way to split 
the computation in more parts than “prior x likelihood”. Hence, the Delayed 
Acceptance algorithm can be applied by simply splitting between the prior p J (ip) 
and the likelihood l(il>\x) ratios, the later being computed first. Moreover, since 
the proposed prior is “non informative”, its influence on the definition of the pos¬ 
terior distribution should be small with respect to the likelihood function and, 
then, computing the likelihood ratio first should not have a substantial impact 
on the acceptance rate. However, the improper nature of the prior means using 
a second acceptance ratio solely based on the prior can create trapping states 
in practice, even though the method remains theoretically valid. We therefore 
opted for stabilising this second step by saving a small fraction of the likelihood, 
corresponding to 5% of the sample, to regularise this second acceptance ratio. 
This choice translates into Algorithm 2. 


Algorithm 2 Metropolis-Hastings with Delayed Acceptance for Mixture Models 

Lp™J n 

Set^ 2 (-|x)= n ^(-|*;) and h{-\x) = JJ -£(-|xi) where p £ (0,1) 

i= 1 i= \_pn\ +1 

1. Simulate if) 1 ~ 

2. Simulate ui,U 2 ~ U( 0,1) and set Ai = ui£i(ip\x)-, 

3. if £i(ip'\x) < Ai, repeat the current parameter value and return to 1; 
else set A 2 = U 2 £ 2 {tp\x)p J (ip); 

4. if £ 2 (ip'\x)p J (ip 1 ) > X 2 accept ip'; 

else repeat the current parameter value and return to 1. 


5.3.2 Simulation study: An experiment comparing a standard Metropolis- 
Hastings implementation with a Metropolis-Hastings version relying on Delayed 
Acceptance is summarised in Table 3. Data were simulated from the following 
Gaussian mixture model: 


(9) 


f(y\6) = 0.1(W(-10, 2) + 0.65AA(0, 5) + 0.25AA(15, 7). 
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Algorithm 

ESS (aver.) 

ESS (sd) 

ESJD (aver.) 

ESJD (sd) 

time (aver.) 

time (sd) 

MH 

1575.963 

245.96 

0.226 

0.44 

513.95 

57.81 

MH + DA 

628.767 

87.86 

0.215 

0.45 

42.22 

22.95 


Table 3 


Comparison using different performance indicators in the example of mixture estimation , based 
on 100 replicas of the experiments according to model (9) with a sample size n = 500, 10 5 MH 
simulations and 500 samples for the prior estimation. (“ESS” is the effective sample 
size, “time” is the computational time). The actual averaged gain ( J f ss ° A IESS MH > 9.58, 

higher than the “double average” that the table above suggests as being around 5. 


Both the standard Metropolis-Hastings and the Delayed Acceptance version 
are adapted against their respective optimal acceptance rate, which is computed 
to be 2%, given that 5 is empirically established to be 0.01 using 500 samples 
for the Monte Carlo estimation of the prior. As a consequence the MH+DA 
algorithm will produce less unique samples in the total 10 5 iterations of the 
chain, as reflected in the lesser ESS in Table 3, but this is counterbalanced by the 
impressive decrease in computing time, leading again to an overall gain in terms 
of ESS /t. of about 9. 


6. CONCLUSION 

We introduced in this paper Delayed Acceptance, a generic and easily imple¬ 
mented modification of the standard Metropolis-Hastings algorithm that splits 
the acceptance rate into more than one step in order to increase the compu¬ 
tational efficiency of the resulting MCMC, under the sole condition that the 
Metropolis-Hastings ratio can be factorised this way. 

The choice of splitting the target distribution into parts ultimately depends 
on the respective costs of computing the said parts and of reducing theoretically 
the overall acceptance rate and expected square jump distance (ESJD). Still, this 
generic alternative to the standard Metropolis-Hastings approach can be con¬ 
sidered on a customary basis, since it both requires very little modification in 
programming and can be easily tested against the basic version, both empiri¬ 
cally and theoretically by the results of (2). The Delayed Acceptance algorithm 
presented in (1) can significantly decrease the computational time per se as well 
as the overall acceptance rate. Nevertheless, the examples presented in Section 5 
suggest that the gain in terms of computational time is not linear in the reduction 
of the acceptance rate, especially in the presence of optimisation techniques like 

(3) ' 

Furthermore, our Delayed Acceptance algorithm does naturally merge with the 
widening range of prefetching techniques, in order to make use of parallelisation 
and reduce the overall computational time even more significantly. Most settings 
of interest are open to take advantage of the proposed method, if mostly in the 
situation of Bayesian statistics where the target density and/or the Metropolis- 
Hastings ratio always allow for a natural factorisation. The case when the likeli¬ 
hood function can be factorised in an useful way represents the best gain brought 
by our solution, in terms of computational time, and it may easily improve even 
more by exploiting parallelisation techniques. 
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